In [4]:
!pip install azureml.opendatasets
Collecting azureml.opendatasets
  Downloading azureml_opendatasets-1.41.0-py3-none-any.whl (1.3 MB)
     |████████████████████████████████| 1.3 MB 5.3 MB/s 
Requirement already satisfied: numpy<=2.0.0,>=1.16.0 in /usr/local/lib/python3.7/dist-packages (from azureml.opendatasets) (1.21.6)
Collecting azureml-core~=1.41.0
  Downloading azureml_core-1.41.0.post3-py3-none-any.whl (2.7 MB)
     |████████████████████████████████| 2.7 MB 29.0 MB/s 
Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
     |████████████████████████████████| 281.4 MB 36 kB/s 
Collecting azureml-dataset-runtime[fuse,pandas]~=1.41.0
  Downloading azureml_dataset_runtime-1.41.0-py3-none-any.whl (3.5 kB)
Requirement already satisfied: pandas<=2.0.0,>=0.21.0 in /usr/local/lib/python3.7/dist-packages (from azureml.opendatasets) (1.3.5)
Requirement already satisfied: pyarrow>=0.16.0 in /usr/local/lib/python3.7/dist-packages (from azureml.opendatasets) (6.0.1)
Requirement already satisfied: scipy<=2.0.0,>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from azureml.opendatasets) (1.4.1)
Collecting azureml-telemetry~=1.41.0
  Downloading azureml_telemetry-1.41.0-py3-none-any.whl (31 kB)
Collecting pyopenssl<23.0.0
  Downloading pyOpenSSL-22.0.0-py2.py3-none-any.whl (55 kB)
     |████████████████████████████████| 55 kB 4.1 MB/s 
Collecting pkginfo
  Downloading pkginfo-1.8.2-py2.py3-none-any.whl (26 kB)
Collecting ndg-httpsclient<=0.5.1
  Downloading ndg_httpsclient-0.5.1-py3-none-any.whl (34 kB)
Collecting azure-core<=1.22.1
  Downloading azure_core-1.22.1-py3-none-any.whl (178 kB)
     |████████████████████████████████| 178 kB 30.2 MB/s 
Requirement already satisfied: requests[socks]<3.0.0,>=2.19.1 in /usr/local/lib/python3.7/dist-packages (from azureml-core~=1.41.0->azureml.opendatasets) (2.23.0)
Collecting msal<2.0.0,>=1.15.0
  Downloading msal-1.17.0-py2.py3-none-any.whl (79 kB)
     |████████████████████████████████| 79 kB 8.1 MB/s 
Requirement already satisfied: python-dateutil<3.0.0,>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from azureml-core~=1.41.0->azureml.opendatasets) (2.8.2)
Collecting jmespath<1.0.0
  Downloading jmespath-0.10.0-py2.py3-none-any.whl (24 kB)
Collecting azure-mgmt-containerregistry<10,>=8.2.0
  Downloading azure_mgmt_containerregistry-9.1.0-py3-none-any.whl (1.1 MB)
     |████████████████████████████████| 1.1 MB 31.2 MB/s 
Collecting PyJWT<3.0.0
  Downloading PyJWT-2.4.0-py3-none-any.whl (18 kB)
Collecting jsonpickle<3.0.0
  Downloading jsonpickle-2.2.0-py2.py3-none-any.whl (39 kB)
Requirement already satisfied: pytz in /usr/local/lib/python3.7/dist-packages (from azureml-core~=1.41.0->azureml.opendatasets) (2022.1)
Collecting argcomplete<3
  Downloading argcomplete-2.0.0-py2.py3-none-any.whl (37 kB)
Collecting azure-mgmt-keyvault<10.0.0,>=0.40.0
  Downloading azure_mgmt_keyvault-9.3.0-py2.py3-none-any.whl (412 kB)
     |████████████████████████████████| 412 kB 25.7 MB/s 
Collecting azure-graphrbac<1.0.0,>=0.40.0
  Downloading azure_graphrbac-0.61.1-py2.py3-none-any.whl (141 kB)
     |████████████████████████████████| 141 kB 24.6 MB/s 
Collecting azure-mgmt-authorization<3,>=0.40.0
  Downloading azure_mgmt_authorization-2.0.0-py2.py3-none-any.whl (465 kB)
     |████████████████████████████████| 465 kB 25.7 MB/s 
Collecting knack~=0.9.0
  Downloading knack-0.9.0-py3-none-any.whl (59 kB)
     |████████████████████████████████| 59 kB 6.3 MB/s 
Collecting humanfriendly<11.0,>=4.7
  Downloading humanfriendly-10.0-py2.py3-none-any.whl (86 kB)
     |████████████████████████████████| 86 kB 4.9 MB/s 
Collecting SecretStorage<4.0.0
  Downloading SecretStorage-3.3.2-py3-none-any.whl (15 kB)
Collecting paramiko<3.0.0,>=2.0.8
  Downloading paramiko-2.10.4-py2.py3-none-any.whl (212 kB)
     |████████████████████████████████| 212 kB 23.6 MB/s 
Collecting msal-extensions<0.4,>=0.3.0
  Downloading msal_extensions-0.3.1-py2.py3-none-any.whl (18 kB)
Collecting azure-mgmt-storage<20.0.0,>=16.0.0
  Downloading azure_mgmt_storage-19.1.0-py3-none-any.whl (1.8 MB)
     |████████████████████████████████| 1.8 MB 23.3 MB/s 
Requirement already satisfied: urllib3<=1.26.7,>=1.23 in /usr/local/lib/python3.7/dist-packages (from azureml-core~=1.41.0->azureml.opendatasets) (1.24.3)
Requirement already satisfied: contextlib2<22.0.0 in /usr/local/lib/python3.7/dist-packages (from azureml-core~=1.41.0->azureml.opendatasets) (0.5.5)
Collecting pathspec<1.0.0
  Downloading pathspec-0.9.0-py2.py3-none-any.whl (31 kB)
Collecting azure-common<2.0.0,>=1.1.12
  Downloading azure_common-1.1.28-py2.py3-none-any.whl (14 kB)
Collecting msrest<1.0.0,>=0.5.1
  Downloading msrest-0.6.21-py2.py3-none-any.whl (85 kB)
     |████████████████████████████████| 85 kB 3.9 MB/s 
Requirement already satisfied: packaging<22.0,>=20.0 in /usr/local/lib/python3.7/dist-packages (from azureml-core~=1.41.0->azureml.opendatasets) (21.3)
Collecting azure-mgmt-resource<21.0.0,>=15.0.0
  Downloading azure_mgmt_resource-20.1.0-py3-none-any.whl (2.3 MB)
     |████████████████████████████████| 2.3 MB 30.2 MB/s 
Collecting docker<6.0.0
  Downloading docker-5.0.3-py2.py3-none-any.whl (146 kB)
     |████████████████████████████████| 146 kB 24.5 MB/s 
Collecting cryptography!=1.9,!=2.0.*,!=2.1.*,!=2.2.*,<37.0.0
  Downloading cryptography-36.0.2-cp36-abi3-manylinux_2_24_x86_64.whl (3.6 MB)
     |████████████████████████████████| 3.6 MB 27.0 MB/s 
Collecting adal<=1.2.7,>=1.2.0
  Downloading adal-1.2.7-py2.py3-none-any.whl (55 kB)
     |████████████████████████████████| 55 kB 3.7 MB/s 
Collecting msrestazure<=0.6.4,>=0.4.33
  Downloading msrestazure-0.6.4-py2.py3-none-any.whl (40 kB)
     |████████████████████████████████| 40 kB 3.1 MB/s 
Collecting backports.tempfile
  Downloading backports.tempfile-1.0-py2.py3-none-any.whl (4.4 kB)
Requirement already satisfied: importlib-metadata<5,>=0.23 in /usr/local/lib/python3.7/dist-packages (from argcomplete<3->azureml-core~=1.41.0->azureml.opendatasets) (4.11.3)
Requirement already satisfied: six>=1.11.0 in /usr/local/lib/python3.7/dist-packages (from azure-core<=1.22.1->azureml-core~=1.41.0->azureml.opendatasets) (1.15.0)
Collecting azure-mgmt-core<2.0.0,>=1.2.0
  Downloading azure_mgmt_core-1.3.0-py2.py3-none-any.whl (25 kB)
Collecting pyarrow>=0.16.0
  Downloading pyarrow-3.0.0-cp37-cp37m-manylinux2014_x86_64.whl (20.7 MB)
     |████████████████████████████████| 20.7 MB 2.6 MB/s 
Collecting azureml-dataprep<3.2.0a,>=3.1.0a
  Downloading azureml_dataprep-3.1.3-py3-none-any.whl (38.6 MB)
     |████████████████████████████████| 38.6 MB 2.0 MB/s 
Collecting fusepy<4.0.0,>=3.0.1
  Downloading fusepy-3.0.1.tar.gz (11 kB)
Collecting azure-identity==1.7.0
  Downloading azure_identity-1.7.0-py2.py3-none-any.whl (129 kB)
     |████████████████████████████████| 129 kB 37.3 MB/s 
Collecting azureml-dataprep-native<39.0.0,>=38.0.0
  Downloading azureml_dataprep_native-38.0.0-cp37-cp37m-manylinux1_x86_64.whl (1.3 MB)
     |████████████████████████████████| 1.3 MB 46.7 MB/s 
Requirement already satisfied: cloudpickle<3.0.0,>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from azureml-dataprep<3.2.0a,>=3.1.0a->azureml-dataset-runtime[fuse,pandas]~=1.41.0->azureml.opendatasets) (1.3.0)
Collecting dotnetcore2<3.0.0,>=2.1.14
  Downloading dotnetcore2-2.1.23-py3-none-manylinux1_x86_64.whl (29.3 MB)
     |████████████████████████████████| 29.3 MB 12.9 MB/s 
Collecting azureml-dataprep-rslex~=2.5.0dev0
  Downloading azureml_dataprep_rslex-2.5.4-cp37-cp37m-manylinux2010_x86_64.whl (15.4 MB)
     |████████████████████████████████| 15.4 MB 15.2 MB/s 
Collecting applicationinsights
  Downloading applicationinsights-0.11.10-py2.py3-none-any.whl (55 kB)
     |████████████████████████████████| 55 kB 3.1 MB/s 
Requirement already satisfied: cffi>=1.12 in /usr/local/lib/python3.7/dist-packages (from cryptography!=1.9,!=2.0.*,!=2.1.*,!=2.2.*,<37.0.0->azureml-core~=1.41.0->azureml.opendatasets) (1.15.0)
Requirement already satisfied: pycparser in /usr/local/lib/python3.7/dist-packages (from cffi>=1.12->cryptography!=1.9,!=2.0.*,!=2.1.*,!=2.2.*,<37.0.0->azureml-core~=1.41.0->azureml.opendatasets) (2.21)
Collecting websocket-client>=0.32.0
  Downloading websocket_client-1.3.2-py3-none-any.whl (54 kB)
     |████████████████████████████████| 54 kB 2.4 MB/s 
Collecting distro>=1.2.0
  Downloading distro-1.7.0-py3-none-any.whl (20 kB)
Requirement already satisfied: typing-extensions>=3.6.4 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata<5,>=0.23->argcomplete<3->azureml-core~=1.41.0->azureml.opendatasets) (4.2.0)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata<5,>=0.23->argcomplete<3->azureml-core~=1.41.0->azureml.opendatasets) (3.8.0)
Requirement already satisfied: pygments in /usr/local/lib/python3.7/dist-packages (from knack~=0.9.0->azureml-core~=1.41.0->azureml.opendatasets) (2.6.1)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.7/dist-packages (from knack~=0.9.0->azureml-core~=1.41.0->azureml.opendatasets) (3.13)
Requirement already satisfied: tabulate in /usr/local/lib/python3.7/dist-packages (from knack~=0.9.0->azureml-core~=1.41.0->azureml.opendatasets) (0.8.9)
Collecting portalocker<3,>=1.0
  Downloading portalocker-2.4.0-py2.py3-none-any.whl (16 kB)
Collecting isodate>=0.6.0
  Downloading isodate-0.6.1-py2.py3-none-any.whl (41 kB)
     |████████████████████████████████| 41 kB 546 kB/s 
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from msrest<1.0.0,>=0.5.1->azureml-core~=1.41.0->azureml.opendatasets) (2021.10.8)
Requirement already satisfied: requests-oauthlib>=0.5.0 in /usr/local/lib/python3.7/dist-packages (from msrest<1.0.0,>=0.5.1->azureml-core~=1.41.0->azureml.opendatasets) (1.3.1)
Requirement already satisfied: pyasn1>=0.1.1 in /usr/local/lib/python3.7/dist-packages (from ndg-httpsclient<=0.5.1->azureml-core~=1.41.0->azureml.opendatasets) (0.4.8)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging<22.0,>=20.0->azureml-core~=1.41.0->azureml.opendatasets) (3.0.8)
Collecting pynacl>=1.0.1
  Downloading PyNaCl-1.5.0-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (856 kB)
     |████████████████████████████████| 856 kB 33.0 MB/s 
Collecting bcrypt>=3.1.3
  Downloading bcrypt-3.2.2-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (62 kB)
     |████████████████████████████████| 62 kB 839 kB/s 
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests[socks]<3.0.0,>=2.19.1->azureml-core~=1.41.0->azureml.opendatasets) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests[socks]<3.0.0,>=2.19.1->azureml-core~=1.41.0->azureml.opendatasets) (3.0.4)
Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.7/dist-packages (from requests-oauthlib>=0.5.0->msrest<1.0.0,>=0.5.1->azureml-core~=1.41.0->azureml.opendatasets) (3.2.0)
Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in /usr/local/lib/python3.7/dist-packages (from requests[socks]<3.0.0,>=2.19.1->azureml-core~=1.41.0->azureml.opendatasets) (1.7.1)
Collecting jeepney>=0.6
  Downloading jeepney-0.8.0-py3-none-any.whl (48 kB)
     |████████████████████████████████| 48 kB 5.1 MB/s 
Collecting backports.weakref
  Downloading backports.weakref-1.0.post1-py2.py3-none-any.whl (5.2 kB)
Collecting py4j==0.10.9.3
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
     |████████████████████████████████| 198 kB 38.1 MB/s 
Building wheels for collected packages: fusepy, pyspark
  Building wheel for fusepy (setup.py) ... done
  Created wheel for fusepy: filename=fusepy-3.0.1-py3-none-any.whl size=10503 sha256=7463de2d1f4223b30c993ead050a4ed44d2b15f415a0331c4ca83a8f2ece4f81
  Stored in directory: /root/.cache/pip/wheels/89/07/84/a5ebfafeefbbc56ceda9d6935a54a8be7a4eccf4ea7e9bf980
  Building wheel for pyspark (setup.py) ... done
  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=8bccaaf03f226092e9fe9785a35b1c957a3cc9d060799276f3cfb52cf0337972
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built fusepy pyspark
Installing collected packages: PyJWT, cryptography, portalocker, msal, isodate, msrest, msal-extensions, distro, azure-core, adal, websocket-client, pyopenssl, pynacl, msrestazure, jmespath, jeepney, dotnetcore2, bcrypt, backports.weakref, azureml-dataprep-rslex, azureml-dataprep-native, azure-mgmt-core, azure-identity, azure-common, argcomplete, SecretStorage, pyarrow, pkginfo, pathspec, paramiko, ndg-httpsclient, knack, jsonpickle, humanfriendly, docker, backports.tempfile, azureml-dataprep, azure-mgmt-storage, azure-mgmt-resource, azure-mgmt-keyvault, azure-mgmt-containerregistry, azure-mgmt-authorization, azure-graphrbac, py4j, fusepy, azureml-dataset-runtime, azureml-core, applicationinsights, pyspark, azureml-telemetry, azureml.opendatasets
  Attempting uninstall: pyarrow
    Found existing installation: pyarrow 6.0.1
    Uninstalling pyarrow-6.0.1:
      Successfully uninstalled pyarrow-6.0.1
Successfully installed PyJWT-2.4.0 SecretStorage-3.3.2 adal-1.2.7 applicationinsights-0.11.10 argcomplete-2.0.0 azure-common-1.1.28 azure-core-1.22.1 azure-graphrbac-0.61.1 azure-identity-1.7.0 azure-mgmt-authorization-2.0.0 azure-mgmt-containerregistry-9.1.0 azure-mgmt-core-1.3.0 azure-mgmt-keyvault-9.3.0 azure-mgmt-resource-20.1.0 azure-mgmt-storage-19.1.0 azureml-core-1.41.0.post3 azureml-dataprep-3.1.3 azureml-dataprep-native-38.0.0 azureml-dataprep-rslex-2.5.4 azureml-dataset-runtime-1.41.0 azureml-telemetry-1.41.0 azureml.opendatasets-1.41.0 backports.tempfile-1.0 backports.weakref-1.0.post1 bcrypt-3.2.2 cryptography-36.0.2 distro-1.7.0 docker-5.0.3 dotnetcore2-2.1.23 fusepy-3.0.1 humanfriendly-10.0 isodate-0.6.1 jeepney-0.8.0 jmespath-0.10.0 jsonpickle-2.2.0 knack-0.9.0 msal-1.17.0 msal-extensions-0.3.1 msrest-0.6.21 msrestazure-0.6.4 ndg-httpsclient-0.5.1 paramiko-2.10.4 pathspec-0.9.0 pkginfo-1.8.2 portalocker-2.4.0 py4j-0.10.9.3 pyarrow-3.0.0 pynacl-1.5.0 pyopenssl-22.0.0 pyspark-3.2.1 websocket-client-1.3.2
In [5]:
!pip install azureml-dataset-runtime
Requirement already satisfied: azureml-dataset-runtime in /usr/local/lib/python3.7/dist-packages (1.41.0)
Requirement already satisfied: azureml-dataprep<3.2.0a,>=3.1.0a in /usr/local/lib/python3.7/dist-packages (from azureml-dataset-runtime) (3.1.3)
Requirement already satisfied: pyarrow<4.0.0,>=0.17.0 in /usr/local/lib/python3.7/dist-packages (from azureml-dataset-runtime) (3.0.0)
Requirement already satisfied: numpy!=1.19.3 in /usr/local/lib/python3.7/dist-packages (from azureml-dataset-runtime) (1.21.6)
Requirement already satisfied: dotnetcore2<3.0.0,>=2.1.14 in /usr/local/lib/python3.7/dist-packages (from azureml-dataprep<3.2.0a,>=3.1.0a->azureml-dataset-runtime) (2.1.23)
Requirement already satisfied: azureml-dataprep-native<39.0.0,>=38.0.0 in /usr/local/lib/python3.7/dist-packages (from azureml-dataprep<3.2.0a,>=3.1.0a->azureml-dataset-runtime) (38.0.0)
Requirement already satisfied: cloudpickle<3.0.0,>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from azureml-dataprep<3.2.0a,>=3.1.0a->azureml-dataset-runtime) (1.3.0)
Requirement already satisfied: azureml-dataprep-rslex~=2.5.0dev0 in /usr/local/lib/python3.7/dist-packages (from azureml-dataprep<3.2.0a,>=3.1.0a->azureml-dataset-runtime) (2.5.4)
Requirement already satisfied: azure-identity==1.7.0 in /usr/local/lib/python3.7/dist-packages (from azureml-dataprep<3.2.0a,>=3.1.0a->azureml-dataset-runtime) (1.7.0)
Requirement already satisfied: six>=1.12.0 in /usr/local/lib/python3.7/dist-packages (from azure-identity==1.7.0->azureml-dataprep<3.2.0a,>=3.1.0a->azureml-dataset-runtime) (1.15.0)
Requirement already satisfied: cryptography>=2.5 in /usr/local/lib/python3.7/dist-packages (from azure-identity==1.7.0->azureml-dataprep<3.2.0a,>=3.1.0a->azureml-dataset-runtime) (36.0.2)
Requirement already satisfied: msal<2.0.0,>=1.12.0 in /usr/local/lib/python3.7/dist-packages (from azure-identity==1.7.0->azureml-dataprep<3.2.0a,>=3.1.0a->azureml-dataset-runtime) (1.17.0)
Requirement already satisfied: azure-core<2.0.0,>=1.11.0 in /usr/local/lib/python3.7/dist-packages (from azure-identity==1.7.0->azureml-dataprep<3.2.0a,>=3.1.0a->azureml-dataset-runtime) (1.22.1)
Requirement already satisfied: msal-extensions~=0.3.0 in /usr/local/lib/python3.7/dist-packages (from azure-identity==1.7.0->azureml-dataprep<3.2.0a,>=3.1.0a->azureml-dataset-runtime) (0.3.1)
Requirement already satisfied: requests>=2.18.4 in /usr/local/lib/python3.7/dist-packages (from azure-core<2.0.0,>=1.11.0->azure-identity==1.7.0->azureml-dataprep<3.2.0a,>=3.1.0a->azureml-dataset-runtime) (2.23.0)
Requirement already satisfied: cffi>=1.12 in /usr/local/lib/python3.7/dist-packages (from cryptography>=2.5->azure-identity==1.7.0->azureml-dataprep<3.2.0a,>=3.1.0a->azureml-dataset-runtime) (1.15.0)
Requirement already satisfied: pycparser in /usr/local/lib/python3.7/dist-packages (from cffi>=1.12->cryptography>=2.5->azure-identity==1.7.0->azureml-dataprep<3.2.0a,>=3.1.0a->azureml-dataset-runtime) (2.21)
Requirement already satisfied: distro>=1.2.0 in /usr/local/lib/python3.7/dist-packages (from dotnetcore2<3.0.0,>=2.1.14->azureml-dataprep<3.2.0a,>=3.1.0a->azureml-dataset-runtime) (1.7.0)
Requirement already satisfied: PyJWT[crypto]<3,>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from msal<2.0.0,>=1.12.0->azure-identity==1.7.0->azureml-dataprep<3.2.0a,>=3.1.0a->azureml-dataset-runtime) (2.4.0)
Requirement already satisfied: portalocker<3,>=1.0 in /usr/local/lib/python3.7/dist-packages (from msal-extensions~=0.3.0->azure-identity==1.7.0->azureml-dataprep<3.2.0a,>=3.1.0a->azureml-dataset-runtime) (2.4.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests>=2.18.4->azure-core<2.0.0,>=1.11.0->azure-identity==1.7.0->azureml-dataprep<3.2.0a,>=3.1.0a->azureml-dataset-runtime) (2021.10.8)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests>=2.18.4->azure-core<2.0.0,>=1.11.0->azure-identity==1.7.0->azureml-dataprep<3.2.0a,>=3.1.0a->azureml-dataset-runtime) (1.24.3)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests>=2.18.4->azure-core<2.0.0,>=1.11.0->azure-identity==1.7.0->azureml-dataprep<3.2.0a,>=3.1.0a->azureml-dataset-runtime) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests>=2.18.4->azure-core<2.0.0,>=1.11.0->azure-identity==1.7.0->azureml-dataprep<3.2.0a,>=3.1.0a->azureml-dataset-runtime) (2.10)
In [6]:
from azureml.opendatasets import NycSafety

This project analyzes from a dataset of all New York City 311 service requests from 2010 to the present.¶

311 requests provided valuable information as to the kinds of service requests that occur in New York City - throughout its five boroughs.¶

The New York Department oversees arterial and residential streets in New York City, receiving reports through the 311 call center - and uses a mapping and tracking system to identify incident locations and schedule crews. One call to 311 can generate multiple repairs. Weather conditions, frigid temps, and preciptation influence how long a repair takes. One days when the weather is cooperative and there's no precipitation, crews can fill several thousand potholes.

The New York Department oversees approximately X street lights that illuminate arterial and residential streets in New York City; and performs repairs and bulb replacements in response to resident's reports of street light outages. Whenever the CDOT receives a report of an "All Out" the electrician assigned to make the repair looks at the lights in that circuit (each circuit has 8-16 lights) to make sure they're working properly.

This data is updated daily.


First, let's download the data¶

In [7]:
#" This is a package in preview.


from datetime import datetime
from dateutil import parser


end_date = parser.parse('2016-01-01')
start_date = parser.parse('2015-05-01')
safety_table = NycSafety(start_date=start_date, end_date=end_date)
nyc_safety = safety_table.to_pandas_dataframe()
/usr/local/lib/python3.7/dist-packages/azureml/opendatasets/dataaccess/_blob_accessor.py:520: Warning: Please install azureml-dataset-runtimeusing pip install azureml-dataset-runtime
  "Please install azureml-dataset-runtime" + "using pip install azureml-dataset-runtime", Warning)
[Info] read from /tmp/tmpdo5oijjj/https%3A/%2Fazureopendatastorage.azurefd.net/citydatacontainer/Safety/Release/city=NewYorkCity/part-00000-tid-7635389979391348899-42387111-75db-4000-84ca-6158481505d9-13418-3.c000.snappy.parquet
[Info] read from /tmp/tmpdo5oijjj/https%3A/%2Fazureopendatastorage.azurefd.net/citydatacontainer/Safety/Release/city=NewYorkCity/part-00001-tid-7635389979391348899-42387111-75db-4000-84ca-6158481505d9-13419-3.c000.snappy.parquet
[Info] read from /tmp/tmpdo5oijjj/https%3A/%2Fazureopendatastorage.azurefd.net/citydatacontainer/Safety/Release/city=NewYorkCity/part-00002-tid-7635389979391348899-42387111-75db-4000-84ca-6158481505d9-13420-3.c000.snappy.parquet
[Info] read from /tmp/tmpdo5oijjj/https%3A/%2Fazureopendatastorage.azurefd.net/citydatacontainer/Safety/Release/city=NewYorkCity/part-00003-tid-7635389979391348899-42387111-75db-4000-84ca-6158481505d9-13421-3.c000.snappy.parquet
[Info] read from /tmp/tmpdo5oijjj/https%3A/%2Fazureopendatastorage.azurefd.net/citydatacontainer/Safety/Release/city=NewYorkCity/part-00004-tid-7635389979391348899-42387111-75db-4000-84ca-6158481505d9-13422-3.c000.snappy.parquet
[Info] read from /tmp/tmpdo5oijjj/https%3A/%2Fazureopendatastorage.azurefd.net/citydatacontainer/Safety/Release/city=NewYorkCity/part-00005-tid-7635389979391348899-42387111-75db-4000-84ca-6158481505d9-13423-3.c000.snappy.parquet
[Info] read from /tmp/tmpdo5oijjj/https%3A/%2Fazureopendatastorage.azurefd.net/citydatacontainer/Safety/Release/city=NewYorkCity/part-00006-tid-7635389979391348899-42387111-75db-4000-84ca-6158481505d9-13424-3.c000.snappy.parquet
[Info] read from /tmp/tmpdo5oijjj/https%3A/%2Fazureopendatastorage.azurefd.net/citydatacontainer/Safety/Release/city=NewYorkCity/part-00007-tid-7635389979391348899-42387111-75db-4000-84ca-6158481505d9-13425-3.c000.snappy.parquet
In [9]:
nyc_safety
Out[9]:
dataType dataSubtype dateTime category subcategory status address latitude longitude source extendedProperties
15 Safety 311_All 2015-06-26 17:39:04 Noise - Street/Sidewalk Loud Music/Party Closed None 40.735370 -73.989969 None
28 Safety 311_All 2015-11-04 11:54:00 Water System Leak (Use Comments) (WA2) Closed None 40.705333 -73.959525 None
74 Safety 311_All 2015-06-03 11:49:22 UNSANITARY CONDITION PESTS Closed 1250 LELAND AVENUE 40.831518 -73.863392 None
75 Safety 311_All 2015-09-21 20:18:16 Illegal Parking Blocked Hydrant Closed 127 BAY 13 STREET 40.607025 -74.009133 None
98 Safety 311_All 2015-07-24 18:25:17 Illegal Parking Double Parked Blocking Vehicle Closed 335 BOWERY 40.726047 -73.991908 None
... ... ... ... ... ... ... ... ... ... ... ...
3550850 Safety 311_All 2015-05-01 11:27:53 HPD Literature Request The ABCs of Housing Closed None NaN NaN None
3550853 Safety 311_All 2015-08-08 21:17:00 Street Light Condition Street Light Out Closed None 40.692653 -73.754600 None
3550894 Safety 311_All 2015-07-06 17:39:37 Damaged Tree Branch Cracked and Will Fall Closed 383 EAST 198 STREET 40.866433 -73.886300 None
3550899 Safety 311_All 2015-05-09 22:16:15 Noise - Commercial Loud Music/Party Closed 1127 PRESIDENT STREET 40.668199 -73.952818 None
3550903 Safety 311_All 2015-12-05 08:20:00 Noise Noise: Construction Before/After Hours (NM1) Closed 33 13 STREET 40.670611 -73.996226 None

1522335 rows × 11 columns

In [10]:
nyc_safety.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1522335 entries, 15 to 3550903
Data columns (total 11 columns):
 #   Column              Non-Null Count    Dtype         
---  ------              --------------    -----         
 0   dataType            1522335 non-null  object        
 1   dataSubtype         1522335 non-null  object        
 2   dateTime            1522335 non-null  datetime64[ns]
 3   category            1522335 non-null  object        
 4   subcategory         1522256 non-null  object        
 5   status              1522335 non-null  object        
 6   address             1220051 non-null  object        
 7   latitude            1377919 non-null  float64       
 8   longitude           1377919 non-null  float64       
 9   source              0 non-null        object        
 10  extendedProperties  1522335 non-null  object        
dtypes: datetime64[ns](1), float64(2), object(8)
memory usage: 139.4+ MB
In [14]:
import pandas as pd
import folium
import requests
import pandas

Data parsing¶

In [15]:
nyc_safety['latitude'] = nyc_safety['latitude'].fillna(0)
nyc_safety['longitude'] = nyc_safety['longitude'].fillna(0)

Basic map of NYC¶

In [17]:
map_osm = folium.Map(location=[40.705920, -73.921794], zoom_start=13)
map_osm
Out[17]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Presentation of all revelant columns in the dataframe¶

In [19]:
for column in nyc_safety.columns:
  print(column)
  print(nyc_safety[column].unique())
  print("\n")
dataType
['Safety']


dataSubtype
['311_All']


dateTime
['2015-06-26T17:39:04.000000000' '2015-11-04T11:54:00.000000000'
 '2015-06-03T11:49:22.000000000' ... '2015-07-06T17:39:37.000000000'
 '2015-05-09T22:16:15.000000000' '2015-12-05T08:20:00.000000000']


category
['Noise - Street/Sidewalk' 'Water System' 'UNSANITARY CONDITION'
 'Illegal Parking' 'Noise - Residential' 'Street Condition' 'DOOR/WINDOW'
 'Maintenance or Facility' 'PAINT/PLASTER' 'Noise - Commercial'
 'FLOORING/STAIRS' 'Housing - Low Income Senior'
 'Traffic Signal Condition' 'Noise - Vehicle' 'Broken Muni Meter'
 'Overflowing Litter Baskets' 'Rodent' 'WATER LEAK' 'HEAT/HOT WATER'
 'Graffiti' 'DOF Property - Reduction Issue' 'Other Enforcement'
 'Derelict Vehicle' 'Blocked Driveway' 'Non-Residential Heat'
 'Consumer Complaint' 'ELECTRIC' 'Sewer' 'Street Light Condition' 'Noise'
 'GENERAL' 'Asbestos' 'DOF Parking - Tax Exemption' 'Damaged Tree'
 'PLUMBING' 'Miscellaneous Categories' 'DOF Property - Owner Issue'
 'New Tree Request' 'Electrical' 'Street Sign - Missing'
 'General Construction/Plumbing' 'Root/Sewer/Sidewalk Condition' 'Traffic'
 'SAFETY' 'Sanitation Condition' 'Housing Options' 'Building/Use'
 'Animal Abuse' 'Indoor Air Quality'
 'Special Projects Inspection Team (SPIT)' 'Food Establishment'
 'Noise - Park' 'Drinking' 'Noise - Helicopter' 'Taxi Complaint'
 'Sidewalk Condition' 'Construction Safety Enforcement' 'Construction'
 'Overgrown Tree/Branches' 'Dirty Conditions'
 'Unsanitary Animal Pvt Property' 'Violation of Park Rules'
 'Homeless Person Assistance' 'Missed Collection (All Materials)'
 'Benefit Card Replacement' 'Emergency Response Team (ERT)'
 'HPD Literature Request' 'Bus Stop Shelter Placement'
 'Street Sign - Damaged' 'Homeless Encampment' 'Curb Condition'
 'Standing Water' 'DOF Parking - Payment Issue' 'Derelict Vehicles'
 'Food Poisoning' 'Taxi Report' 'SCRIE' 'Dead Tree' 'Plumbing' 'APPLIANCE'
 'DOF Property - Payment Issue' 'Lead' 'For Hire Vehicle Complaint'
 'Non-Emergency Police Matter' 'DOF Property - Update Account'
 'DOF Property - RPIE Issue' 'Illegal Tree Damage' 'Harboring Bees/Wasps'
 'BEST/Site Safety' 'Air Quality' 'Day Care' 'DPR Internal'
 'Illegal Animal Kept as Pet' 'Water Conservation' 'Elevator' 'Vending'
 'Mold' 'Posting Advertisement' 'Quality of Life' 'Illegal Animal Sold'
 'Industrial Waste' 'City Vehicle Placard Complaint' 'Hazardous Materials'
 'DOF Property - Request Copy' 'Home Delivered Meal - Missed Delivery'
 'Sweeping/Missed' 'For Hire Vehicle Report' 'DOF Parking - DMV Clearance'
 'School Maintenance' 'Litter Basket / Request' 'Highway Condition'
 'Street Sign - Dangling' 'OUTSIDE BUILDING' 'Found Property'
 'Foam Ban Enforcement' 'Animal in a Park' 'Poison Ivy'
 'Building Marshals office' 'Collection Truck Noise' 'Smoking'
 'Recycling Enforcement' 'Disorderly Youth' 'Indoor Sewage'
 'Investigations and Discipline (IAD)' 'Derelict Bicycle' 'Plant'
 'Vacant Lot' 'Sweeping/Inadequate' 'X-Ray Machine/Equipment'
 'Public Payphone Complaint' 'Facades' 'Water Quality'
 'Noise - House of Worship' 'DOF Parking - Request Status'
 'Broken Parking Meter' 'Taxi Compliment' 'Cranes and Derricks'
 'Overflowing Recycling Baskets' 'ATF' 'ELEVATOR'
 'Senior Center Complaint' 'Bike/Roller/Skate Chronic' 'Elder Abuse'
 'Unsanitary Pigeon Condition' 'Bus Stop Shelter Complaint'
 'OEM Literature Request' 'Unleashed Dog' 'Bereavement Support Group'
 'Ferry Inquiry' 'Boilers' 'Illegal Fireworks'
 'DCA / DOH New License Application Request' 'Urinating in Public'
 'Scaffold Safety' 'Beach/Pool/Sauna Complaint' "Alzheimer's Care"
 'Bridge Condition' 'DOF Parking - Request Copy' 'Public Toilet'
 'Drinking Water' 'Taxpayer Advocate Inquiry' 'Advocate - Other'
 'Ferry Complaint' 'VACANT APARTMENT' 'Highway Sign - Dangling'
 'Window Guard' 'Highway Sign - Damaged' 'Panhandling'
 'Home Delivered Meal Complaint' 'Municipal Parking Facility'
 'Animal Facility - No Permit' 'Bike Rack Condition'
 'Unsanitary Animal Facility' 'Adopt-A-Basket'
 'Home Care Provider Complaint' 'Snow' 'FATF' 'Lifeguard' 'Parking Card'
 'Special Natural Area District (SNAD)' 'Advocate-Personal Exemptions'
 'Highway Sign - Missing' 'DHS Income Savings Requirement'
 'DOF Property - Property Value' 'Tattooing'
 'Case Management Agency Complaint' 'Stalled Sites' 'SRDE'
 'Forensic Engineering' 'Ferry Permit' 'NORC Complaint' 'Calorie Labeling'
 'Advocate-SCRIE/DRIE' 'Transportation Provider Complaint'
 'Legal Services Provider Complaint' 'DOF Property - City Rebate'
 'Bottled Water' 'Radioactive Material' 'Interior Demo' 'Tunnel Condition'
 'Advocate-Prop Class Incorrect' 'Building Condition' 'Squeegee' 'AGENCY'
 'Tanning' 'Advocate - RPIE' 'Advocate-Commercial Exemptions']


subcategory
['Loud Music/Party' 'Leak (Use Comments) (WA2)' 'PESTS' ...
 'Dead End Sign' 'Woodside Settlement Project' 'Pedicab Driver']


status
['Closed' 'Pending' 'Open' 'Assigned' 'Started' 'Draft' 'Unspecified']


address
[None '1250 LELAND AVENUE' '127 BAY 13 STREET' ... '679 ROCKAWAY TURNPIKE'
 '44-0-44-98 CRESCENT STREET' '60 EAST 7 STREET']


latitude
[40.73537039 40.70533331 40.83151754 ... 40.62085745 40.76977938
 40.69657653]


longitude
[-73.98996874 -73.95952516 -73.86339214 ... -74.13178925 -73.89080854
 -73.93870428]


source
[None]


extendedProperties
['']


Data Collection¶

Examine the spread of four types of categories of incidents

In [24]:
for index, value in nyc_safety.head(10000).iterrows():
  if value.category == 'Noise - Residential':
    folium.Marker(location=[value["latitude"], value["longitude"]],
            icon=folium.Icon(color='green'), popup='Noise - Residential').add_to(map_osm)
  elif value.category == 'Damaged Tree':
      folium.Marker(location=[value["latitude"], value["longitude"]],
            icon=folium.Icon(color='purple'), popup='Damaged Tree').add_to(map_osm)
  elif value.category ==  'Bike Rack Condition':
      folium.Marker(location=[value["latitude"], value["longitude"]],
            icon=folium.Icon(color='yellow'), popup='Bike Rack Condition').add_to(map_osm)
  elif value.category == 'Curb Condition':
      folium.Marker(location=[value["latitude"], value["longitude"]],
            icon=folium.Icon(color='black'), popup='Curb Condition').add_to(map_osm)
In [25]:
map_osm
Out[25]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [26]:
alt_map = folium.Map(location=[40.705920, -73.921794], tiles="Stamen Terrain", zoom_start=12.5)
alt_map
Out[26]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [28]:
for index, value in nyc_safety.head(10000).iterrows():
  if value.category == 'WATER LEAK':
    if value.subcategory == 'HEAVY FLOW':
      folium.Marker(location=[value["latitude"], value["longitude"]],
              icon=folium.Icon(color='green'), popup="Heavy Flow").add_to(alt_map)
    elif value.subcategory == 'SLOW LEAK':
        folium.Marker(location=[value["latitude"], value["longitude"]],
              icon=folium.Icon(color='purple'), popup="Slow Leak").add_to(alt_map)
    elif value.subcategory == 'DAMP SPOT':
        folium.Marker(location=[value["latitude"], value["longitude"]],
              icon=folium.Icon(color='yellow'), popup="Damp Spot").add_to(alt_map)
alt_map
Out[28]:
Make this Notebook Trusted to load map: File -> Trust Notebook

More data collection¶

Four more categories on an alternate map

In [29]:
for index, value in nyc_safety.head(10000).iterrows():
  if value.category == 'PLUMBING':
    if value.subcategory == 'TOILET':
      folium.Marker(location=[value["latitude"], value["longitude"]],
              icon=folium.Icon(color='green'), popup="toilet").add_to(alt_map)
    elif value.subcategory == 'BATHTUB/SHOWER':
        folium.Marker(location=[value["latitude"], value["longitude"]],
              icon=folium.Icon(color='purple'), popup="Bathtub/Shower").add_to(alt_map)
    elif value.subcategory == 'BASIN/SINK':
        folium.Marker(location=[value["latitude"], value["longitude"]],
              icon=folium.Icon(color='red'), popup="Basin/Sink").add_to(alt_map)
alt_map
Out[29]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Numerous subcategories of a particular service request¶

In [30]:
for index, value in nyc_safety.head(10000).iterrows():
  if value.category == 'Street Condition':
    if value.subcategory == 'Pothole':
      folium.Marker(location=[value["latitude"], value["longitude"]],
              icon=folium.Icon(color='green'), popup="pothole").add_to(alt_map)
    elif value.subcategory == 'Defective Hardware':
        folium.Marker(location=[value["latitude"], value["longitude"]],
              icon=folium.Icon(color='purple'), popup="defective hardware").add_to(alt_map)
    elif value.subcategory == 'Rough, Pitted or Cracked Roads':
        folium.Marker(location=[value["latitude"], value["longitude"]],
              icon=folium.Icon(color='red'), popup="rough, pitted or cracked roads").add_to(alt_map)
    elif value.subcategory == 'Cave-in':
        folium.Marker(location=[value["latitude"], value["longitude"]],
              icon=folium.Icon(color='black'), popup="cave-in").add_to(alt_map)
    elif value.subcategory == 'Line/Marking - Faded':
        folium.Marker(location=[value["latitude"], value["longitude"]],
              icon=folium.Icon(color='white'), popup="line/marking - faded").add_to(alt_map)
    
alt_map
Out[30]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [38]:
!pip install geopandas
Collecting geopandas
  Downloading geopandas-0.10.2-py2.py3-none-any.whl (1.0 MB)
     |████████████████████████████████| 1.0 MB 5.2 MB/s 
Requirement already satisfied: shapely>=1.6 in /usr/local/lib/python3.7/dist-packages (from geopandas) (1.8.1.post1)
Requirement already satisfied: pandas>=0.25.0 in /usr/local/lib/python3.7/dist-packages (from geopandas) (1.3.5)
Collecting fiona>=1.8
  Downloading Fiona-1.8.21-cp37-cp37m-manylinux2014_x86_64.whl (16.7 MB)
     |████████████████████████████████| 16.7 MB 39.2 MB/s 
Collecting pyproj>=2.2.0
  Downloading pyproj-3.2.1-cp37-cp37m-manylinux2010_x86_64.whl (6.3 MB)
     |████████████████████████████████| 6.3 MB 38.6 MB/s 
Requirement already satisfied: click>=4.0 in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (7.1.2)
Collecting munch
  Downloading munch-2.5.0-py2.py3-none-any.whl (10 kB)
Collecting cligj>=0.5
  Downloading cligj-0.7.2-py3-none-any.whl (7.1 kB)
Requirement already satisfied: six>=1.7 in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (1.15.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (57.4.0)
Requirement already satisfied: attrs>=17 in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (21.4.0)
Requirement already satisfied: certifi in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (2021.10.8)
Collecting click-plugins>=1.0
  Downloading click_plugins-1.1.1-py2.py3-none-any.whl (7.5 kB)
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.25.0->geopandas) (1.21.6)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.25.0->geopandas) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.25.0->geopandas) (2022.1)
Installing collected packages: munch, cligj, click-plugins, pyproj, fiona, geopandas
Successfully installed click-plugins-1.1.1 cligj-0.7.2 fiona-1.8.21 geopandas-0.10.2 munch-2.5.0 pyproj-3.2.1
In [39]:
import geopandas as gpd
import folium
import matplotlib.pyplot as plt
In [40]:
path = gpd.datasets.get_path('nybb')
df = gpd.read_file(path)
df
Out[40]:
BoroCode BoroName Shape_Leng Shape_Area geometry
0 5 Staten Island 330470.010332 1.623820e+09 MULTIPOLYGON (((970217.022 145643.332, 970227....
1 4 Queens 896344.047763 3.045213e+09 MULTIPOLYGON (((1029606.077 156073.814, 102957...
2 3 Brooklyn 741080.523166 1.937479e+09 MULTIPOLYGON (((1021176.479 151374.797, 102100...
3 1 Manhattan 359299.096471 6.364715e+08 MULTIPOLYGON (((981219.056 188655.316, 980940....
4 2 Bronx 464392.991824 1.186925e+09 MULTIPOLYGON (((1012821.806 229228.265, 101278...

Exploratory data analysis¶

Now let's zone in by creating the boundaries of the five boroughs of New York City.

In [41]:
df.plot(figsize=(6, 6))
plt.show()
In [42]:
df.crs
Out[42]:
<Projected CRS: EPSG:2263>
Name: NAD83 / New York Long Island (ftUS)
Axis Info [cartesian]:
- X[east]: Easting (US survey foot)
- Y[north]: Northing (US survey foot)
Area of Use:
- name: United States (USA) - New York - counties of Bronx; Kings; Nassau; New York; Queens; Richmond; Suffolk.
- bounds: (-74.26, 40.47, -71.8, 41.3)
Coordinate Operation:
- name: SPCS83 New York Long Island zone (US Survey feet)
- method: Lambert Conic Conformal (2SP)
Datum: North American Datum 1983
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich
In [43]:
df = df.to_crs(epsg=4326)
print(df.crs)
df
epsg:4326
Out[43]:
BoroCode BoroName Shape_Leng Shape_Area geometry
0 5 Staten Island 330470.010332 1.623820e+09 MULTIPOLYGON (((-74.05051 40.56642, -74.05047 ...
1 4 Queens 896344.047763 3.045213e+09 MULTIPOLYGON (((-73.83668 40.59495, -73.83678 ...
2 3 Brooklyn 741080.523166 1.937479e+09 MULTIPOLYGON (((-73.86706 40.58209, -73.86769 ...
3 1 Manhattan 359299.096471 6.364715e+08 MULTIPOLYGON (((-74.01093 40.68449, -74.01193 ...
4 2 Bronx 464392.991824 1.186925e+09 MULTIPOLYGON (((-73.89681 40.79581, -73.89694 ...
In [45]:
m = folium.Map(location=[40.70, -73.94], zoom_start=10, tiles='CartoDB positron')
for _, r in df.iterrows():
    # Without simplifying the representation of each borough,
    # the map might not be displayed
    sim_geo = gpd.GeoSeries(r['geometry']).simplify(tolerance=0.001)
    geo_j = sim_geo.to_json()
    geo_j = folium.GeoJson(data=geo_j,
                           style_function=lambda x: {'fillColor': 'orange'})
    folium.Popup(r['BoroName']).add_to(geo_j)
    geo_j.add_to(m)
m
Out[45]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Hypothesis testing¶

Begin to examine a trend in where specific kinds of service requests show up. Specifically, try to predit which borough a service incident will show up in.

In [49]:
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="my-application")
In [ ]:
m2 = folium.Map(location=[40.70, -73.94], zoom_start=10, tiles='CartoDB positron')
for _, r in df.iterrows():
    # Without simplifying the representation of each borough,
    # the map might not be displayed
    sim_geo = gpd.GeoSeries(r['geometry']).simplify(tolerance=0.001)
    geo_j = sim_geo.to_json()
    geo_j = folium.GeoJson(data=geo_j,
                           style_function=lambda x: {'fillColor': 'orange'})
    folium.Popup(r['BoroName']).add_to(geo_j)
    geo_j.add_to(m2)
m2
In [51]:
import numpy as np
In [52]:
df1 = pd.DataFrame([ ['Staten Island', 0,0,0,0] , ['Queens',0,0,0,0] , ['Brooklyn', 0,0,0,0] , ['Manhattan',0,0,0,0] , ['The Bronx', 0,0,0,0]  ],   
    columns=['Borough', 'Loud Music/Party', 'Banging/Pounding', 'Loud Talking', 'Loud Television']) 
df1
Out[52]:
Borough Loud Music/Party Banging/Pounding Loud Talking Loud Television
0 Staten Island 0 0 0 0
1 Queens 0 0 0 0
2 Brooklyn 0 0 0 0
3 Manhattan 0 0 0 0
4 The Bronx 0 0 0 0

Boroughs are marked if the service request occurs at that location.¶

In [53]:
def funct(event):
  if loc.find("Staten Island") != -1:
    df1.at[0, event]+=1
    folium.Marker(location=[value["latitude"], value["longitude"]],
    icon=folium.Icon(color='purple'), popup=event + '@Staten Island').add_to(m)
  elif loc.find("Queens")  != -1:
    df1.at[1, event]+=1
    folium.Marker(location=[value["latitude"], value["longitude"]],
    icon=folium.Icon(color='blue'), popup=event + '@Queens').add_to(m)
  elif loc.find("Brooklyn") != -1:
    df1.at[2, event]+=1
    folium.Marker(location=[value["latitude"], value["longitude"]],
    icon=folium.Icon(color='red'), popup=event + '@Brooklyn').add_to(m)])
  elif loc.find("Manhattan") != -1:
    df1.at[3, event]+=1
    folium.Marker(location=[value["latitude"], value["longitude"]],
    icon=folium.Icon(color='yellow'), popup=event + '@Manhattan').add_to(m)
  elif loc.find("Bronx") != -1:
    df1.at[4, event]+=1
    folium.Marker(location=[value["latitude"], value["longitude"]],
    icon=folium.Icon(color='green'), popup=event + '@Bronx').add_to(m)


for index, value in nyc_safety.head(1000).iterrows():
  
  if value.category == 'Noise - Residential':
    
    location = geolocator.reverse(    str(value["latitude"]) + ", " + str(value["longitude"])  )
    loc = str(location)
    
    if value.subcategory == 'Loud Music/Party':
      funct('Loud Music/Party')
    elif value.subcategory == 'Banging/Pounding':
      funct('Banging/Pounding')
    elif value.subcategory == 'Loud Talking':
      funct('Loud Talking')
    elif value.subcategory == 'Loud Television':
      funct('Loud Television')

How do these four categories geographically occur in the city?¶

The database reveals the answer by number of occurences

In [54]:
df1
Out[54]:
Borough Loud Music/Party Banging/Pounding Loud Talking Loud Television
0 Staten Island 4 1 1 0
1 Queens 17 5 0 1
2 Brooklyn 17 6 4 2
3 Manhattan 16 3 2 0
4 The Bronx 14 6 1 0

A bar graph showing the distribution of a service incidents for dataframe 1¶

In [57]:
df1.plot.bar(x='Borough', rot=90)
plt.xticks(rotation='horizontal')
plt.margins(0.7)
plt.subplots_adjust(bottom=0.15)
plt.show()

A map visually showing the distribution of service incidents for dataframe 1¶

In [245]:
m
Out[245]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [58]:
df2 = pd.DataFrame(
    [ ['Staten Island', 0,0,0,0] , ['Queens',0,0,0,0] , ['Brooklyn', 0,0,0,0] , ['Manhattan',0,0,0,0] , ['The Bronx', 0,0,0,0]  ],
                   columns=['Borough', 'BATHTUB/SHOWER', 'BASIN/SINK', 'BOILER', 'TOILET']) 
                  #columns=['Manhattan', 'Brooklyn', 'Queens', 'Staten Island', 'The Bronx'])
df2
Out[58]:
Borough BATHTUB/SHOWER BASIN/SINK BOILER TOILET
0 Staten Island 0 0 0 0
1 Queens 0 0 0 0
2 Brooklyn 0 0 0 0
3 Manhattan 0 0 0 0
4 The Bronx 0 0 0 0
In [59]:
def funct2(event):
  if loc.find("Staten Island") != -1:
    df2.at[0, event]+=1
    folium.Marker(location=[value["latitude"], value["longitude"]],
    icon=folium.Icon(color='purple'), popup=event + '@Staten Island').add_to(m2)
  elif loc.find("Queens")  != -1:
    df2.at[1, event]+=1
    folium.Marker(location=[value["latitude"], value["longitude"]],
    icon=folium.Icon(color='blue'), popup=event + '@Queens').add_to(m2)
  elif loc.find("Brooklyn") != -1:
    df2.at[2, event]+=1
    folium.Marker(location=[value["latitude"], value["longitude"]],
    icon=folium.Icon(color='red'), popup=event + '@Brooklyn').add_to(m2)
  elif loc.find("Manhattan") != -1:
    df2.at[3, event]+=1
    folium.Marker(location=[value["latitude"], value["longitude"]],
    icon=folium.Icon(color='orange'), popup=event + '@Manhattan').add_to(m2)
  elif loc.find("Bronx") != -1:
    df2.at[4, event]+=1
    folium.Marker(location=[value["latitude"], value["longitude"]],
    icon=folium.Icon(color='green'), popup=event + '@Bronx').add_to(m2)

for index, value in nyc_safety.head(1000).iterrows():
  
  if value.category == 'PLUMBING':
    
    location= geolocator.reverse(    str(value["latitude"]) + ", " + str(value["longitude"])  )
    loc = str(location)
    
    if value.subcategory == 'BATHTUB/SHOWER':
      funct2('BATHTUB/SHOWER')
    elif value.subcategory == 'BASIN/SINK':
      funct2('BASIN/SINK')
    elif value.subcategory == 'BOILER':
      funct2('BOILER')
    elif value.subcategory == 'TOILET':
      funct2('TOILET')
In [60]:
df2
Out[60]:
Borough BATHTUB/SHOWER BASIN/SINK BOILER TOILET
0 Staten Island 0 1 0 0
1 Queens 1 1 0 0
2 Brooklyn 0 5 1 0
3 Manhattan 0 0 0 0
4 The Bronx 2 2 0 1

A bar graph showing the distribution of service incidents for dataframe 2¶

In [61]:
df2.plot.bar(x='Borough', rot=90)
plt.xticks(rotation='horizontal')
plt.margins(0.7)
plt.subplots_adjust(bottom=0.15)
plt.show()

A map also showing the distribution of service incidents for dataframe 2¶

In [62]:
m2
Out[62]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [ ]:
m3 = folium.Map(location=[40.70, -73.94], zoom_start=10, tiles='CartoDB positron')
for _, r in df.iterrows():
    # Without simplifying the representation of each borough,
    # the map might not be displayed
    sim_geo = gpd.GeoSeries(r['geometry']).simplify(tolerance=0.001)
    geo_j = sim_geo.to_json()
    geo_j = folium.GeoJson(data=geo_j,
                           style_function=lambda x: {'fillColor': 'orange'})
    folium.Popup(r['BoroName']).add_to(geo_j)
    geo_j.add_to(m3)
m3
In [64]:
df3 = pd.DataFrame([ ['Staten Island', 0,0,0,0] , ['Queens',0,0,0,0] , ['Brooklyn', 0,0,0,0] , ['Manhattan',0,0,0,0] , ['The Bronx', 0,0,0,0]  ],
                   columns=['Borough', 'Posted Parking Sign Violation', 'Commercial Overnight Parking', 'Overnight Commercial Storage', 'Double Parked Blocking Traffic']) 
                 
df3
Out[64]:
Borough Posted Parking Sign Violation Commercial Overnight Parking Overnight Commercial Storage Double Parked Blocking Traffic
0 Staten Island 0 0 0 0
1 Queens 0 0 0 0
2 Brooklyn 0 0 0 0
3 Manhattan 0 0 0 0
4 The Bronx 0 0 0 0
In [65]:
def funct3(event):
  if loc.find("Staten Island") != -1:
    df3.at[0, event]+=1
    folium.Marker(location=[value["latitude"], value["longitude"]],
    icon=folium.Icon(color='purple'), popup=event + '@Staten Island').add_to(m3)
  elif loc.find("Queens")  != -1:
    df3.at[1, event]+=1
    folium.Marker(location=[value["latitude"], value["longitude"]],
    icon=folium.Icon(color='blue'), popup=event + '@Queens').add_to(m3)
  elif loc.find("Brooklyn") != -1:
    df3.at[2, event]+=1
    folium.Marker(location=[value["latitude"], value["longitude"]],
    icon=folium.Icon(color='red'), popup=event + '@Brooklyn').add_to(m3)
  elif loc.find("Manhattan") != -1:
    df3.at[3, event]+=1
    folium.Marker(location=[value["latitude"], value["longitude"]],
    icon=folium.Icon(color='orange'), popup=event + '@Manhattan').add_to(m3)
  elif loc.find("Bronx") != -1:
    df3.at[4, event]+=1
    folium.Marker(location=[value["latitude"], value["longitude"]],
    icon=folium.Icon(color='green'), popup=event + '@Bronx').add_to(m3)

for index, value in nyc_safety.head(1000).iterrows():
  
  if value.category == 'Illegal Parking':
    
    location= geolocator.reverse(    str(value["latitude"]) + ", " + str(value["longitude"])  )
    loc = str(location)
    
    if value.subcategory == 'Posted Parking Sign Violation':
      funct3('Posted Parking Sign Violation')
    elif value.subcategory == 'Commercial Overnight Parking':
      funct3('Commercial Overnight Parking')
    elif value.subcategory == 'Overnight Commercial Storage':
      funct3('Overnight Commercial Storage')
    elif value.subcategory == 'Double Parked Blocking Traffic':
      funct3('Double Parked Blocking Traffic')
In [66]:
df3
Out[66]:
Borough Posted Parking Sign Violation Commercial Overnight Parking Overnight Commercial Storage Double Parked Blocking Traffic
0 Staten Island 0 3 0 0
1 Queens 7 0 0 0
2 Brooklyn 5 3 1 1
3 Manhattan 1 1 0 1
4 The Bronx 2 0 1 1

A bar graph showing the distribution of service incidents for dataframe 3¶

In [67]:
import matplotlib.pyplot as plt
df3.plot.bar(x='Borough', rot=90)
plt.xticks(rotation='horizontal')
plt.margins(0.7)
plt.subplots_adjust(bottom=0.15)
plt.show()

A map visually showing the distribution of service incidents for dataframe 3¶

In [68]:
m3
Out[68]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [ ]:
m4 = folium.Map(location=[40.70, -73.94], zoom_start=10, tiles='CartoDB positron')
for _, r in df.iterrows():
    # Without simplifying the representation of each borough,
    # the map might not be displayed
    sim_geo = gpd.GeoSeries(r['geometry']).simplify(tolerance=0.001)
    geo_j = sim_geo.to_json()
    geo_j = folium.GeoJson(data=geo_j,
                           style_function=lambda x: {'fillColor': 'orange'})
    folium.Popup(r['BoroName']).add_to(geo_j)
    geo_j.add_to(m4)
m4

This predicts the borough where an incident will occur based on patterns showed in dataframes 1, 2, 3.¶

In [240]:
def threshold_func(incident): # uses df1 df2 df3
  val = max(df1.at[0, incident], df1.at[1, incident], df1.at[2, incident], df1.at[3, incident], df1.at[4, incident])
  ret = df1.loc[df1[incident] == val]['Borough']
  return ret.sample(n=1).iloc[0]
In [242]:
m4
Out[242]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [136]:
resid = nyc_safety[nyc_safety['category'] == 'Noise - Residential']

Communication of insights attained¶

The borough predictor function is run a subset of data from the original dataframe that contains just one specific kind of service request.

In [96]:
resid
Out[96]:
dataType dataSubtype dateTime category subcategory status address latitude longitude source extendedProperties
125 Safety 311_All 2015-05-29 11:03:10 Noise - Residential Loud Music/Party Closed 1050 MADISON STREET 40.690599 -73.917453 None
156 Safety 311_All 2015-05-03 02:27:22 Noise - Residential Loud Music/Party Closed 62 WILSON AVENUE 40.702362 -73.928655 None
392 Safety 311_All 2015-09-10 23:20:52 Noise - Residential Loud Music/Party Closed 3911 AVENUE S 40.611260 -73.929022 None
540 Safety 311_All 2015-09-08 22:52:44 Noise - Residential Loud Music/Party Closed 39 EAST 17 STREET 40.650008 -73.964154 None
625 Safety 311_All 2015-10-25 01:39:06 Noise - Residential Loud Music/Party Closed 63 ELLWOOD STREET 40.860580 -73.928626 None
... ... ... ... ... ... ... ... ... ... ... ...
3550056 Safety 311_All 2015-07-04 20:04:37 Noise - Residential Loud Talking Closed 312 WEBSTER AVENUE 40.633228 -73.969974 None
3550299 Safety 311_All 2015-06-12 12:36:17 Noise - Residential Loud Television Closed 310 WEST 99 STREET 40.797338 -73.972172 None
3550557 Safety 311_All 2015-06-27 20:17:56 Noise - Residential Loud Music/Party Closed 749 LAFAYETTE AVENUE 40.690804 -73.943361 None
3550574 Safety 311_All 2015-08-08 22:08:13 Noise - Residential Loud Music/Party Closed None 40.513024 -74.250696 None
3550643 Safety 311_All 2015-11-08 01:38:56 Noise - Residential Loud Music/Party Closed 409 WEST 129 STREET 40.813830 -73.951927 None

148295 rows × 11 columns

Geolocator object finds where the incident actually occurs based on its latitude and longitude.¶

In [205]:
def find_actual_borough(lat, long):
  locat = geolocator.reverse(str(lat) + ", " + str(long))
  loc = str(locat)
  if loc.find("Staten Island") != -1:
    return "Staten Island"
  elif loc.find("Queens")  != -1:
    return "Queens"
  elif loc.find("Brooklyn") != -1:
    return "Brooklyn"
  elif loc.find("Manhattan") != -1:
    return "Manhattan"
  elif loc.find("Bronx") != -1:
    return "Bronx"
  return np.NAN

Here is where the predictions occur¶

The estimated borough location of a given service request is compared with the actual borough location and thus an error/accuracy rate is able to be obtained that shows how accurate the model is at predicting the correct borough location.

In [241]:
error_rate = 0
import tqdm

for index, value in resid.head(1000).iterrows():
  predicted_borough = threshold_func(value['subcategory'])
  correct_borough = find_actual_borough(value["latitude"], value["longitude"])

  folium.Marker(location=[value["latitude"], value["longitude"]],
  icon=folium.Icon(color='purple'), popup='@' + predicted_borough).add_to(m4)

  if predicted_borough != correct_borough:
    error_rate+=1
print(error_rate/1000*100)
75.8

Plotting the accuracy for the threshold function¶

In [244]:
accuracy_rate = (1000 - error_rate)/1000 *100
print(accuracy_rate)
#plot the figures
plt.figure(figsize=(14, 8))
plt.subplot(1,2,1)
plt.title("Accuracy of borough predictor as threshold")
plt.bar(range(1), accuracy_rate)
plt.ylim(0, 100)
plt.xlabel("Threshold required")
plt.ylabel("Accuracy Percentage")
24.2
Out[244]:
Text(0, 0.5, 'Accuracy Percentage')

Plotting the error for the threshold function¶

In [248]:
#Plotting the error
#plot the figures
plt.figure(figsize=(14, 8))
plt.subplot(1,2,1)
plt.title("Error of borough predictor as threshold")
plt.bar(range(1), error_rate/1000*100)
plt.ylim(0, 100)
plt.xlabel("Threshold required")
plt.ylabel("Error Percentage")
Out[248]:
Text(0, 0.5, 'Error Percentage')

Conclusion¶

Thank you for reading this data science tutorial.